Búsqueda | Portal Regional de la BVS

1.

Automatic Detection of Intimate Partner Violence Victims from Social Media for Proactive Delivery of Support.

Guo, Yuting; Kim, Sangmi; Warren, Elise; Yang, Yuan-Chi; Lakamana, Sahithi; Sarker, Abeed.

AMIA Jt Summits Transl Sci Proc ; 2023: 254-260, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37351791

RESUMEN

Social media platforms are increasingly being used by intimate partner violence (IPV) victims to share experiences and seek support. If such information is automatically curated, it may be possible to conduct social media based surveillance and even design interventions over such platforms. In this paper, we describe the development of a supervised classification system that automatically characterizes IPV-related posts on the social network Reddit. We collected data from four IPV-related subreddits and manually annotated the data to indicate whether a post is a self-report of IPV or not. Using the annotated data (N=289), we trained, evaluated, and compared supervised machine learning systems. A transformer-based classifier, RoBERTa, obtained the best classification performance with overall accuracy of 78% and IPV-self-report class ð¹1 -score of 0.67. Post-classification error analyses revealed that misclassifications often occur for posts that are very long or are non-first-person reports of IPV. Despite the relatively small annotated data, our classification methods obtained promising results, indicating that it may be possible to detect and, hence, provide support to IPV victims over Reddit.

2.

The Early Detection of Fraudulent COVID-19 Products From Twitter Chatter: Data Set and Baseline Approach Using Anomaly Detection.

Sarker, Abeed; Lakamana, Sahithi; Liao, Ruqi; Abbas, Aamir; Yang, Yuan-Chi; Al-Garadi, Mohammed.

JMIR Infodemiology ; 3: e43694, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37113382

RESUMEN

Background: Social media has served as a lucrative platform for spreading misinformation and for promoting fraudulent products for the treatment, testing, and prevention of COVID-19. This has resulted in the issuance of many warning letters by the US Food and Drug Administration (FDA). While social media continues to serve as the primary platform for the promotion of such fraudulent products, it also presents the opportunity to identify these products early by using effective social media mining methods. Objective: Our objectives were to (1) create a data set of fraudulent COVID-19 products that can be used for future research and (2) propose a method using data from Twitter for automatically detecting heavily promoted COVID-19 products early. Methods: We created a data set from FDA-issued warnings during the early months of the COVID-19 pandemic. We used natural language processing and time-series anomaly detection methods for automatically detecting fraudulent COVID-19 products early from Twitter. Our approach is based on the intuition that increases in the popularity of fraudulent products lead to corresponding anomalous increases in the volume of chatter regarding them. We compared the anomaly signal generation date for each product with the corresponding FDA letter issuance date. We also performed a brief manual analysis of chatter associated with 2 products to characterize their contents. Results: FDA warning issue dates ranged from March 6, 2020, to June 22, 2021, and 44 key phrases representing fraudulent products were included. From 577,872,350 posts made between February 19 and December 31, 2020, which are all publicly available, our unsupervised approach detected 34 out of 44 (77.3%) signals about fraudulent products earlier than the FDA letter issuance dates, and an additional 6 (13.6%) within a week following the corresponding FDA letters. Content analysis revealed misinformation, information, political, and conspiracy theories to be prominent topics. Conclusions: Our proposed method is simple, effective, easy to deploy, and does not require high-performance computing machinery unlike deep neural network-based methods. The method can be easily extended to other types of signal detection from social media data. The data set may be used for future research and the development of more advanced methods.

3.

Characteristics of Intimate Partner Violence and Survivor's Needs During the COVID-19 Pandemic: Insights From Subreddits Related to Intimate Partner Violence.

Kim, Sangmi; Warren, Elise; Jahangir, Tasfia; Al-Garadi, Mohammed; Guo, Yuting; Yang, Yuan-Chi; Lakamana, Sahithi; Sarker, Abeed.

J Interpers Violence ; 38(17-18): 9693-9716, 2023 09.

Artículo en Inglés | MEDLINE | ID: mdl-37102576

RESUMEN

Intimate partner violence (IPV) increased during the COVID-19 pandemic. Collecting actionable IPV-related data from conventional sources (e.g., medical records) was challenging during the pandemic, generating a need to obtain relevant data from non-conventional sources, such as social media. Social media, like Reddit, is a preferred medium of communication for IPV survivors to share their experiences and seek support with protected anonymity. Nevertheless, the scope of available IPV-related data on social media is rarely documented. Thus, we examined the availability of IPV-related information on Reddit and the characteristics of the reported IPV during the pandemic. Using natural language processing, we collected publicly available Reddit data from four IPV-related subreddits between January 1, 2020 and March 31, 2021. Of 4,000 collected posts, we randomly sampled 300 posts for analysis. Three individuals on the research team independently coded the data and resolved the coding discrepancies through discussions. We adopted quantitative content analysis and calculated the frequency of the identified codes. 36% of the posts (n = 108) constituted self-reported IPV by survivors, of which 40% regarded current/ongoing IPV, and 14% contained help-seeking messages. A majority of the survivors' posts reflected psychological aggression, followed by physical violence. Notably, 61.4% of the psychological aggression involved expressive aggression, followed by gaslighting (54.3%) and coercive control (44.3%). Survivors' top three needs during the pandemic were hearing similar experiences, legal advice, and validating their feelings/reactions/thoughts/actions. Albeit limited, data from bystanders (survivors' friends, family, or neighbors) were also available. Rich data reflecting IPV survivors' lived experiences were available on Reddit. Such information will be useful for IPV surveillance, prevention, and intervention.

Asunto(s)

COVID-19 , Violencia de Pareja , Humanos , Pandemias , Violencia de Pareja/psicología , Coerción , Sobrevivientes/psicología

4.

Can accurate demographic information about people who use prescription medications nonmedically be derived from Twitter?

Yang, Yuan-Chi; Al-Garadi, Mohammed Ali; Love, Jennifer S; Cooper, Hannah L F; Perrone, Jeanmarie; Sarker, Abeed.

Proc Natl Acad Sci U S A ; 120(8): e2207391120, 2023 02 21.

Artículo en Inglés | MEDLINE | ID: mdl-36787355

RESUMEN

Traditional substance use (SU) surveillance methods, such as surveys, incur substantial lags. Due to the continuously evolving trends in SU, insights obtained via such methods are often outdated. Social media-based sources have been proposed for obtaining timely insights, but methods leveraging such data cannot typically provide fine-grained statistics about subpopulations, unlike traditional approaches. We address this gap by developing methods for automatically characterizing a large Twitter nonmedical prescription medication use (NPMU) cohort (n = 288,562) in terms of age-group, race, and gender. Our natural language processing and machine learning methods for automated cohort characterization achieved 0.88 precision (95% CI:0.84 to 0.92) for age-group, 0.90 (95% CI: 0.85 to 0.95) for race, and 94% accuracy (95% CI: 92 to 97) for gender, when evaluated against manually annotated gold-standard data. We compared automatically derived statistics for NPMU of tranquilizers, stimulants, and opioids from Twitter with statistics reported in the National Survey on Drug Use and Health (NSDUH) and the National Emergency Department Sample (NEDS). Distributions automatically estimated from Twitter were mostly consistent with the NSDUH [Spearman r: race: 0.98 (P < 0.005); age-group: 0.67 (P < 0.005); gender: 0.66 (P = 0.27)] and NEDS, with 34/65 (52.3%) of the Twitter-based estimates lying within 95% CIs of estimates from the traditional sources. Explainable differences (e.g., overrepresentation of younger people) were found for age-group-related statistics. Our study demonstrates that accurate subpopulation-specific estimates about SU, particularly NPMU, may be automatically derived from Twitter to obtain earlier insights about targeted subpopulations compared to traditional surveillance approaches.

Asunto(s)

Estimulantes del Sistema Nervioso Central , Medios de Comunicación Sociales , Trastornos Relacionados con Sustancias , Humanos , Trastornos Relacionados con Sustancias/epidemiología , Prescripciones , Demografía

5.

Automatic Detection of Twitter Users Who Express Chronic Stress Experiences via Supervised Machine Learning and Natural Language Processing.

Yang, Yuan-Chi; Xie, Angel; Kim, Sangmi; Hair, Jessica; Al-Garadi, Mohammed; Sarker, Abeed.

Comput Inform Nurs ; 41(9): 717-724, 2023 Sep 01.

Artículo en Inglés | MEDLINE | ID: mdl-36445331

RESUMEN

Americans bear a high chronic stress burden, particularly during the COVID-19 pandemic. Although social media have many strengths to complement the weaknesses of conventional stress measures, including surveys, they have been rarely utilized to detect individuals self-reporting chronic stress. Thus, this study aimed to develop and evaluate an automatic system on Twitter to identify users who have self-reported chronic stress experiences. Using the Twitter public streaming application programming interface, we collected tweets containing certain stress-related keywords (eg, "chronic," "constant," "stress") and then filtered the data using pre-defined text patterns. We manually annotated tweets with (without) self-report of chronic stress as positive (negative). We trained multiple classifiers and tested them via accuracy and F1 score. We annotated 4195 tweets (1560 positives, 2635 negatives), achieving an inter-annotator agreement of 0.83 (Cohen's kappa). The classifier based on Bidirectional Encoder Representation from Transformers performed the best (accuracy of 83.6% [81.0-86.1]), outperforming the second best-performing classifier (support vector machines: 76.4% [73.5-79.3]). The past tweets from the authors of positive tweets contained useful information, including sources and health impacts of chronic stress. Our study demonstrates that users' self-reported chronic stress experiences can be automatically identified on Twitter, which has a high potential for surveillance and large-scale intervention.

Asunto(s)

COVID-19 , Medios de Comunicación Sociales , Humanos , Procesamiento de Lenguaje Natural , Pandemias , Aprendizaje Automático Supervisado

6.

The Role of Natural Language Processing during the COVID-19 Pandemic: Health Applications, Opportunities, and Challenges.

Al-Garadi, Mohammed Ali; Yang, Yuan-Chi; Sarker, Abeed.

Healthcare (Basel) ; 10(11)2022 Nov 12.

Artículo en Inglés | MEDLINE | ID: mdl-36421593

RESUMEN

The COVID-19 pandemic is the most devastating public health crisis in at least a century and has affected the lives of billions of people worldwide in unprecedented ways. Compared to pandemics of this scale in the past, societies are now equipped with advanced technologies that can mitigate the impacts of pandemics if utilized appropriately. However, opportunities are currently not fully utilized, particularly at the intersection of data science and health. Health-related big data and technological advances have the potential to significantly aid the fight against such pandemics, including the current pandemic's ongoing and long-term impacts. Specifically, the field of natural language processing (NLP) has enormous potential at a time when vast amounts of text-based data are continuously generated from a multitude of sources, such as health/hospital systems, published medical literature, and social media. Effectively mitigating the impacts of the pandemic requires tackling challenges associated with the application and deployment of NLP systems. In this paper, we review the applications of NLP to address diverse aspects of the COVID-19 pandemic. We outline key NLP-related advances on a chosen set of topics reported in the literature and discuss the opportunities and challenges associated with applying NLP during the current pandemic and future ones. These opportunities and challenges can guide future research aimed at improving the current health and social response systems and pandemic preparedness.

7.

Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification.

Guo, Yuting; Ge, Yao; Yang, Yuan-Chi; Al-Garadi, Mohammed Ali; Sarker, Abeed.

Healthcare (Basel) ; 10(8)2022 Aug 05.

Artículo en Inglés | MEDLINE | ID: mdl-36011135

RESUMEN

Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources-BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT-on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT + TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.

8.

Tracking the COVID-19 outbreak in India through Twitter: Opportunities for social media based global pandemic surveillance.

Lakamana, Sahithi; Yang, Yuan-Chi; Al-Garadi, Mohammed Ali; Sarker, Abeed.

AMIA Jt Summits Transl Sci Proc ; 2022: 313-322, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-35854749

RESUMEN

We investigated the utility of Twitter for conducting multi-faceted geolocation-centric pandemic surveillance, using India as an example. We collected over 4 million COVID19-related tweets related to the Indian outbreak between January and July 2021. We geolocated the tweets, applied natural language processing to characterize the tweets (eg., identifying symptoms and emotions), and compared tweet volumes with the numbers of confirmed COVID-19 cases. Tweet numbers closely mirrored the outbreak, with the 7-day average strongly correlated with confirmed COVID-19 cases nationally (Spearman r=0.944; p=0.001), and also at the state level (Spearman r=0.84, p=0.0003). Fatigue, Dyspnea and Cough were the top symptoms detected, while there was a significant increase in the proportion of tweets expressing negative emotions (eg., fear and sadness). The surge in COVID-19 tweets was followed by increased number of posts expressing concern about black fungus and oxygen supply. Our study illustrates the potential of social media for multi-faceted pandemic surveillance.

9.

Large-Scale Social Media Analysis Reveals Emotions Associated with Nonmedical Prescription Drug Use.

Al-Garadi, Mohammed Ali; Yang, Yuan-Chi; Guo, Yuting; Kim, Sangmi; Love, Jennifer S; Perrone, Jeanmarie; Sarker, Abeed.

Health Data Sci ; 20222022.

Artículo en Inglés | MEDLINE | ID: mdl-37621877

RESUMEN

Background: The behaviors and emotions associated with and reasons for nonmedical prescription drug use (NMPDU) are not well-captured through traditional instruments such as surveys and insurance claims. Publicly available NMPDU-related posts on social media can potentially be leveraged to study these aspects unobtrusively and at scale. Methods: We applied a machine learning classifier to detect self-reports of NMPDU on Twitter and extracted all public posts of the associated users. We analyzed approximately 137 million posts from 87,718 Twitter users in terms of expressed emotions, sentiments, concerns, and possible reasons for NMPDU via natural language processing. Results: Users in the NMPDU group express more negative emotions and less positive emotions, more concerns about family, the past, and body, and less concerns related to work, leisure, home, money, religion, health, and achievement compared to a control group (i.e., users who never reported NMPDU). NMPDU posts tend to be highly polarized, indicating potential emotional triggers. Gender-specific analyses show that female users in the NMPDU group express more content related to positive emotions, anticipation, sadness, joy, concerns about family, friends, home, health, and the past, and less about anger than males. The findings are consistent across distinct prescription drug categories (opioids, benzodiazepines, stimulants, and polysubstance). Conclusion: Our analyses of large-scale data show that substantial differences exist between the texts of the posts from users who self-report NMPDU on Twitter and those who do not, and between males and females who report NMPDU. Our findings can enrich our understanding of NMPDU and the population involved.

10.

A comparison of few-shot and traditional named entity recognition models for medical text.

Ge, Yao; Guo, Yuting; Yang, Yuan-Chi; Al-Garadi, Mohammed Ali; Sarker, Abeed.

IEEE Int Conf Healthc Inform ; 2022: 84-89, 2022 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-37641590

RESUMEN

Many research problems involving medical texts have limited amounts of annotated data available (e.g., expressions of rare diseases). Traditional supervised machine learning algorithms, particularly those based on deep neural networks, require large volumes of annotated data, and they underperform when only small amounts of labeled data are available. Few-shot learning (FSL) is a category of machine learning models that are designed with the intent of solving problems that have small annotated datasets available. However, there is no current study that compares the performances of FSL models with traditional models (e.g., conditional random fields) for medical text at different training set sizes. In this paper, we attempted to fill this gap in research by comparing multiple FSL models with traditional models for the task of named entity recognition (NER) from medical texts. Using five health-related annotated NER datasets, we benchmarked three traditional NER models based on BERT-BERT-Linear Classifier (BLC), BERT-CRF (BC) and SANER; and three FSL NER models-StructShot & NNShot, Few-Shot Slot Tagging (FS-ST) and ProtoNER. Our benchmarking results show that almost all models, whether traditional or FSL, achieve significantly lower performances compared to the state-of-the-art with small amounts of training data. For the NER experiments we executed, the F1-scores were very low with small training sets, typically below 30%. FSL models that were reported to perform well on non-medical texts significantly underperformed, compared to their reported best, on medical texts. Our experiments also suggest that FSL methods tend to perform worse on data sets from noisy sources of medical texts, such as social media (which includes misspellings and colloquial expressions), compared to less noisy sources such as medical literature. Our experiments demonstrate that the current state-of-the-art FSL systems are not yet suitable for effective NER in medical natural language processing tasks, and further research needs to be carried out to improve their performances. Creation of specialized, standardized datasets replicating real-world scenarios may help to move this category of methods forward.

11.

Thematic Analysis of Reddit Content About Buprenorphine-naloxone Using Manual Annotation and Natural Language Processing Techniques.

Graves, Rachel Lynn; Perrone, Jeanmarie; Al-Garadi, Mohammed Ali; Yang, Yuan-Chi; Love, Jennifers; O'Connor, Karen; Gonzalez-Hernandez, Graciela; Sarker, Abeed.

J Addict Med ; 16(4): 454-460, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-34864788

RESUMEN

BACKGROUND: Opioid use disorder (OUD) is a major public health crisis for which buprenorphine-naloxone is an effective evidence-based treatment. Analysis of Reddit data yields detailed information about firsthand experiences with buprenorphine-naloxone that has the potential to inform treatment of OUD. METHODS: We conducted a thematic analysis of posts about buprenorphine-naloxone from a Reddit forum in which Reddit users anonymously discuss topics related to opioid use. We used an application programming interface to retrieve posts about buprenorphine-naloxone, then applied natural language processing to generate meta-information and curate samples of salient posts. We manually categorized posts according to their content and conducted natural language processing-aided analysis of posts about buprenorphine tapering strategies, withdrawal symptoms, and adjunctive substances/behaviors useful in the tapering process. RESULTS: A total of 16,146 posts from 1933 redditors were retrieved from the /r/suboxone subreddit. Thematic analysis of sample posts (N = 200) revealed descriptions of personal experiences (74%), nonpersonal accounts (24%), and other content (2%). Among redditors who reported tapering to termination (N = 40), 0.063 mg and 0.125 mg were the most common termination doses. Fatigue, gastrointestinal disturbance, and mood disturbance were the most frequent adverse effects, and loperamide and vitamins/dietary supplements the most frequently discussed adverse effects adjunctive substances/behaviors respectively. CONCLUSIONS: Discussions on Reddit are rich in information about buprenorphine-naloxone. Information derived from analysis of Reddit posts about buprenorphine-naloxone may not be available elsewhere and may help providers improve treatment of people with OUD through better understanding of the experiences of people who have used buprenorphine-naloxone.

Asunto(s)

Buprenorfina , Trastornos Relacionados con Opioides , Síndrome de Abstinencia a Sustancias , Buprenorfina/uso terapéutico , Combinación Buprenorfina y Naloxona/uso terapéutico , Humanos , Antagonistas de Narcóticos/uso terapéutico , Procesamiento de Lenguaje Natural , Trastornos Relacionados con Opioides/tratamiento farmacológico , Síndrome de Abstinencia a Sustancias/tratamiento farmacológico

12.

Natural language model for automatic identification of Intimate Partner Violence reports from Twitter.

Al-Garadi, Mohammed Ali; Kim, Sangmi; Guo, Yuting; Warren, Elise; Yang, Yuan-Chi; Lakamana, Sahithi; Sarker, Abeed.

Array (N Y) ; 152022 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-37006948

RESUMEN

Intimate partner violence (IPV) is a preventable public health problem that affects millions of people worldwide. Approximately one in four women are estimated to be or have been victims of severe violence at some point in their lives, irrespective of age, ethnicity, and economic status. Victims often report IPV experiences on social media, and automatic detection of such reports via machine learning may enable improved surveillance and targeted distribution of support and/or interventions for those in need. However, no artificial intelligence systems for automatic detection currently exists, and we attempted to address this research gap. We collected posts from Twitter using a list of IPV-related keywords, manually reviewed subsets of retrieved posts, and prepared annotation guidelines to categorize tweets into IPV-report or non-IPV-report. We annotated 6,348 tweets in total, with the inter-annotator agreement (IAA) of 0.86 (Cohen's kappa) among 1,834 double-annotated tweets. The class distribution in the annotated dataset was highly imbalanced, with only 668 posts (~11%) labeled as IPV-report. We then developed an effective natural language processing model to identify IPV-reporting tweets automatically. The developed model achieved classification F1-scores of 0.76 for the IPV-report class and 0.97 for the non-IPV-report class. We conducted post-classification analyses to determine the causes of system errors and to ensure that the system did not exhibit biases in its decision making, particularly with respect to race and gender. Our automatic model can be an essential component for a proactive social media-based intervention and support framework, while also aiding population-level surveillance and large-scale cohort studies.

13.

Defining Patient-Oriented Natural Language Processing: A New Paradigm for Research and Development to Facilitate Adoption and Use by Medical Experts.

Sarker, Abeed; Al-Garadi, Mohammed Ali; Yang, Yuan-Chi; Choi, Jinho; Quyyumi, Arshed A; Martin, Greg S.

JMIR Med Inform ; 9(9): e18471, 2021 Sep 28.

Artículo en Inglés | MEDLINE | ID: mdl-34581670

RESUMEN

The capabilities of natural language processing (NLP) methods have expanded significantly in recent years, and progress has been particularly driven by advances in data science and machine learning. However, NLP is still largely underused in patient-oriented clinical research and care (POCRC). A key reason behind this is that clinical NLP methods are typically developed, optimized, and evaluated with narrowly focused data sets and tasks (eg, those for the detection of specific symptoms in free texts). Such research and development (R&D) approaches may be described as problem oriented, and the developed systems perform specialized tasks well. As standalone systems, however, they generally do not comprehensively meet the needs of POCRC. Thus, there is often a gap between the capabilities of clinical NLP methods and the needs of patient-facing medical experts. We believe that to increase the practical use of biomedical NLP, future R&D efforts need to be broadened to a new research paradigm-one that explicitly incorporates characteristics that are crucial for POCRC. We present our viewpoint about 4 such interrelated characteristics that can increase NLP systems' suitability for POCRC (3 that represent NLP system properties and 1 associated with the R&D process)-(1) interpretability (the ability to explain system decisions), (2) patient centeredness (the capability to characterize diverse patients), (3) customizability (the flexibility for adapting to distinct settings, problems, and cohorts), and (4) multitask evaluation (the validation of system performance based on multiple tasks involving heterogeneous data sets). By using the NLP task of clinical concept detection as an example, we detail these characteristics and discuss how they may result in the increased uptake of NLP systems for POCRC.

14.

Automatic gender detection in Twitter profiles for health-related cohort studies.

Yang, Yuan-Chi; Al-Garadi, Mohammed Ali; Love, Jennifer S; Perrone, Jeanmarie; Sarker, Abeed.

JAMIA Open ; 4(2): ooab042, 2021 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-34169232

RESUMEN

OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user's demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study. MATERIALS AND METHODS: We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users' information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system's utility. RESULTS AND DISCUSSION: We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0-94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0-96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends-proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37). CONCLUSION: Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public).

15.

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid.

Yang, Yuan-Chi; Al-Garadi, Mohammed Ali; Bremer, Whitney; Zhu, Jane M; Grande, David; Sarker, Abeed.

J Med Internet Res ; 23(5): e26616, 2021 05 03.

Artículo en Inglés | MEDLINE | ID: mdl-33938807

RESUMEN

BACKGROUND: The wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers' perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the diversity of content in social media chatter. OBJECTIVE: This study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an example. METHODS: We collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website's search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a sample of tweets in 5 predetermined categories or other and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods. RESULTS: We manually annotated 11,379 tweets (Corpus 1: 9179; Corpus 2: 2200) and used 7930 (69.7%) for training, 1449 (12.7%) for validation, and 2000 (17.6%) for testing. A classifier based on BERT obtained the highest accuracies (81.7%, Corpus 1; 80.7%, Corpus 2) and F1 scores on consumer feedback (0.58, Corpus 1; 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6%, RF on Corpus 1; 69.4%, RF on Corpus 2) and F1 score on consumer feedback (0.44, NN on Corpus 1; 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78%) and consumer feedback (15073/27337, 55.14%) tweets being the most frequent for Corpus 1 and Corpus 2, respectively. CONCLUSIONS: The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies.

Asunto(s)

Medios de Comunicación Sociales , Teorema de Bayes , Servicios de Salud , Humanos , Medicaid , Procesamiento de Lenguaje Natural , Estados Unidos

16.

Text classification models for the automatic detection of nonmedical prescription medication use from social media.

Al-Garadi, Mohammed Ali; Yang, Yuan-Chi; Cai, Haitao; Ruan, Yucheng; O'Connor, Karen; Graciela, Gonzalez-Hernandez; Perrone, Jeanmarie; Sarker, Abeed.

BMC Med Inform Decis Mak ; 21(1): 27, 2021 01 26.

Artículo en Inglés | MEDLINE | ID: mdl-33499852

RESUMEN

BACKGROUND: Prescription medication (PM) misuse/abuse has emerged as a national crisis in the United States, and social media has been suggested as a potential resource for performing active monitoring. However, automating a social media-based monitoring system is challenging-requiring advanced natural language processing (NLP) and machine learning methods. In this paper, we describe the development and evaluation of automatic text classification models for detecting self-reports of PM abuse from Twitter. METHODS: We experimented with state-of-the-art bi-directional transformer-based language models, which utilize tweet-level representations that enable transfer learning (e.g., BERT, RoBERTa, XLNet, AlBERT, and DistilBERT), proposed fusion-based approaches, and compared the developed models with several traditional machine learning, including deep learning, approaches. Using a public dataset, we evaluated the performances of the classifiers on their abilities to classify the non-majority "abuse/misuse" class. RESULTS: Our proposed fusion-based model performs significantly better than the best traditional model (F1-score [95% CI]: 0.67 [0.64-0.69] vs. 0.45 [0.42-0.48]). We illustrate, via experimentation using varying training set sizes, that the transformer-based models are more stable and require less annotated data compared to the other models. The significant improvements achieved by our best-performing classification model over past approaches makes it suitable for automated continuous monitoring of nonmedical PM use from Twitter. CONCLUSIONS: BERT, BERT-like and fusion-based models outperform traditional machine learning and deep learning models, achieving substantial improvements over many years of past research on the topic of prescription medication misuse/abuse classification from social media, which had been shown to be a complex task due to the unique ways in which information about nonmedical use is presented. Several challenges associated with the lack of context and the nature of social media language need to be overcome to further improve BERT and BERT-like models. These experimental driven challenges are represented as potential future research directions.

Asunto(s)

Medicamentos bajo Prescripción , Medios de Comunicación Sociales , Humanos , Aprendizaje Automático , Procesamiento de Lenguaje Natural , Prescripciones

17.

Self-reported COVID-19 symptoms on Twitter: an analysis and a research resource.

Sarker, Abeed; Lakamana, Sahithi; Hogg-Bremer, Whitney; Xie, Angel; Al-Garadi, Mohammed Ali; Yang, Yuan-Chi.

J Am Med Inform Assoc ; 27(8): 1310-1315, 2020 08 01.

Artículo en Inglés | MEDLINE | ID: mdl-32620975

RESUMEN

OBJECTIVE: To mine Twitter and quantitatively analyze COVID-19 symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon for future research. MATERIALS AND METHODS: We retrieved tweets using COVID-19-related keywords, and performed semiautomatic filtering to curate self-reports of positive-tested users. We extracted COVID-19-related symptoms mentioned by the users, mapped them to standard concept IDs in the Unified Medical Language System, and compared the distributions to those reported in early studies from clinical settings. RESULTS: We identified 203 positive-tested users who reported 1002 symptoms using 668 unique expressions. The most frequently-reported symptoms were fever/pyrexia (66.1%), cough (57.9%), body ache/pain (42.7%), fatigue (42.1%), headache (37.4%), and dyspnea (36.3%) amongst users who reported at least 1 symptom. Mild symptoms, such as anosmia (28.7%) and ageusia (28.1%), were frequently reported on Twitter, but not in clinical studies. CONCLUSION: The spectrum of COVID-19 symptoms identified from Twitter may complement those identified in clinical settings.

Asunto(s)

Infecciones por Coronavirus , Pandemias , Neumonía Viral , Autoinforme , Medios de Comunicación Sociales , Evaluación de Síntomas , Betacoronavirus , COVID-19 , Infecciones por Coronavirus/complicaciones , Infecciones por Coronavirus/diagnóstico , Minería de Datos , Humanos , Procesamiento de Lenguaje Natural , Neumonía Viral/complicaciones , Neumonía Viral/diagnóstico , SARS-CoV-2

18.

A Light-Weight Text Summarization System for Fast Access to Medical Evidence.

Sarker, Abeed; Yang, Yuan-Chi; Al-Garadi, Mohammed Ali; Abbas, Aamir.

Front Digit Health ; 2: 585559, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-34713057

RESUMEN

As the volume of published medical research continues to grow rapidly, staying up-to-date with the best-available research evidence regarding specific topics is becoming an increasingly challenging problem for medical experts and researchers. The current COVID19 pandemic is a good example of a topic on which research evidence is rapidly evolving. Automatic query-focused text summarization approaches may help researchers to swiftly review research evidence by presenting salient and query-relevant information from newly-published articles in a condensed manner. Typical medical text summarization approaches require domain knowledge, and the performances of such systems rely on resource-heavy medical domain-specific knowledge sources and pre-processing methods (e.g., text classification) for deriving semantic information. Consequently, these systems are often difficult to speedily customize, extend, or deploy in low-resource settings, and they are often operationally slow. In this paper, we propose a fast and simple extractive summarization approach that can be easily deployed and run, and may thus aid medical experts and researchers obtain fast access to the latest research evidence. At runtime, our system utilizes similarity measurements derived from pre-trained medical domain-specific word embeddings in addition to simple features, rather than computationally-expensive pre-processing and resource-heavy knowledge bases. Automatic evaluation using ROUGE-a summary evaluation tool-on a public dataset for evidence-based medicine shows that our system's performance, despite the simple implementation, is statistically comparable with the state-of-the-art. Extrinsic manual evaluation based on recently-released COVID19 articles demonstrates that the summarizer performance is close to human agreement, which is generally low, for extractive summarization.

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA